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ABSTRACT 

In order to assist security analysts in obtaining information 
pertaining to their network, such as novel vulnerabilities, 
exploits, or patches, information retrieval methods tailored 
to the security domain are needed. As labeled text data 
is scarce and expensive, we follow developments in semi- 
supervised Natural Language Processing and implement a 
bootstrapping algorithm for extracting security entities and 
their relationships from text. The algorithm requires little 
input data, specifically, a few relations or patterns (heuris¬ 
tics for identifying relations), and incorporates an active 
learning component which queries the user on the most im¬ 
portant decisions to prevent drifting from the desired rela¬ 
tions. Preliminary testing on a small corpus shows promising 
results, obtaining precision of .82. 

Categories and Subject Descriptors 

H. 3.3 [Information Systems]: Storage and Retrieval 
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I. INTRODUCTION 

The overall motivation behind this work is to aid security 
analysts in finding and understanding information on vul¬ 
nerabilities, attacks, or patches that are applicable to their 
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network. As discussed in public disclosure of vulnera¬ 
bilities and exploits often occur in obscure text sources, such 
as forums or mailing lists, sometimes months before inclu¬ 
sion in databases such as the CVE or NVDQ Hence, tailored 
methods of automated information retrieval are needed for 
immediate awareness of security flaws. Additionally, our ex¬ 
perience with analysts shows that necessary tasks, such as 
triage and response to alerts, performing network forensics, 
and seeking mitigation techniques, require security practi¬ 
tioners to find and process a large amount of information 
that is external to, but applicable to their network. Such 
information resides in online resources such as vulnerabil¬ 
ity and exploit databases, vendor bulletins, news feeds, and 
security blogs, and is either unstructured text or a struc¬ 
tured database with text description fields. In short, there 
is a critical need to aid security analysts in processing text- 
based sources by providing automated information retrieval 
and organization techniques. To address this, we seek entity 
and relation extraction techniques to identify software ven¬ 
dors, products, and versions in text along with vulnerability 
terms and the mutual associations among these entities. 

Relation extraction is the area of natural language pro¬ 
cessing (NLP) that seeks to recover structured data in the 
form of (subject entity, predicate relation, object entity)- 
triples that match a database schema from text sources. For 
example, from the sentence “Microsoft has released a fix for a 
critical bug that affected its Internet Explorer browser.”, we 
would like to extract (Microsoft, is_vendor_of, Internet Ex¬ 
plorer) as an instance of the (software vendor, is_vendor_of, 
software product) relation. Our choices for entity concepts 
and their mutual relations are driven by an ontology of se¬ 
curity domain, which gives a schema for storage in a graph 
database (see Sections Ell Ell) ideally making such infor¬ 
mation easily accessible to analysts. As annotated textual 
data is scant and costly to produce, we describe a semi- 
supervised technique for entity and relation extraction that 
incorporates active learning (querying the user) to assist in 
labeling only a few, most influential instances. While our 
implementation combines elements of many previous boot¬ 
strapping systems in the literature, applying these methods 
to relation extraction in cyber-security is to the best of our 
knowledge novel. Preliminary testing on a small corpus gives 
promising results; in particular, low false positive rates are 
obtained. Ideally, this preliminary work will lead to a fully 
developed operational version that automatically populates 
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a knowledge base of cyber-security concepts from pertinent 
text sources. 

2. BOOTSTRAPPING 

As is often the case with deploying NLP techniques to a 
specific domain, the main hurdle is the lack of annotated 
data, which is expensive to produce. Consequently, super¬ 
vised means of information extraction, although thoroughly 
developed, are not applicable. To accommodate this con¬ 
straint, we implement a semi-supervised approach for re¬ 
lationship extraction that follows the previous work in the 
literature, but is tailored to our needs. Our implementa¬ 
tion builds on Brin’s Dual Iterative Pattern Relation Ex¬ 
pansion DIPRE algorithm [4], which uses a cyclic process 
to iteratively build known relation instances and heuristics 
for finding those instances. Input to the algorithm is (1) 
a relatively small set of “seed” instances of a given rela¬ 
tion and/or seed patterns (heuristics for identifying a re¬ 
lation instance generally from the surrounding text) and (2) 
a relatively large corpus of unlabeled documents that are 
believed to contain many instances of the relation. The pro¬ 
cess proceeds by searching the corpus for mentions of the 
few known seed instances, and, upon finding an instance, 
the system automatically generates patterns from the sur¬ 
rounding text and stores these heuristics. Next, the corpus 
is traversed a second time using the patterns to identify 
any, hopefully new, instances of the relation, which are then 
stored in the seed set. To prevent the system from stray¬ 
ing to undesired relations and patterns, a common addition 
to the DIPRE process is a method of confidence scoring for 
nominated patterns/relation instances [3. Our scoring pro¬ 
cedure follows the BASILISK method of Thelen & Riloff [TS] 
and is discussed in Section A variety of works have 
made contributions to this basic bootstrapping process, and 
we reference the improvements that have shaped our im¬ 
plementation throughout the discussion. We note that this 
bootstrapping process has been used solely for entity extrac¬ 
tion in many works, in particular, [a [3 mill [18]. 

2.1 Entity Extraction & Document Relevancy 

In order to extract a relation instance, the two entities 
involved must be identified, and this task has previously 
been accomplished in two ways. First, by using relation 
extraction patterns that identify entities; for instance, the 
initial work by Brin used regular expressions as part of the 
patterns to identify potential entities and relations simul¬ 
taneously [Ij. Secondly, a common technique for relation 
extraction algorithms (bootstrapping and otherwise) is by 
using named entity recognition (NER) tools to identify en¬ 
tities, and afterward proceed with techniques to find entity 
pairs which share the desired relation [T] [2| (Hi 11511161117 |. 
Unfortunately, our experience as well as that of previous 
works have found that “off-the-shelf” NER tools often fail 
to identify many cyber-security domain concepts [HiiTaiTi]. 
Using FreebaseQ an online knowledge base, for entity ex¬ 
traction and disambiguation occurs in previous works [S] 
115] . and more generally, calling on public databases to aid 
information retrieval is present in related works [IKIelllZj. 

Following this line of thought, our implementation uses 
gazetteers and regular expressions to label entities and dis¬ 
cards documents deemed irrelevant before the bootstrap- 
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Table 1: Entity Types Extraction Methods 


Entity Type 

Example (s) 

Extraction Method 

SW_Vendor 

Adobe 

Gazetteer (Freebase) 

SW_Product 

Acrobat 

Gazetteer (Freebase) 

SW_Version 

7, 11.0.08 

Regular Expressions 

CVEJD 

CVE-2014-1127 

Regular Expression 

MS_ID 

MS-14-011 

Regular Expression 

Vuln_Term 

xss, sql injection 

Gazetteer (handmade) 

SW_Symbol 

pAlloc(), reg.exe 

Regular Expressions 


Note: SW_Product includes operating systems and applica¬ 
tions. Vuln_Term entities are members of a hand-crafted list 
developed in [3] that are terms descriptive of vulnerabilities 
or attack consequences and are similar to the “Attack” en¬ 
tity type in m- SW_Symbol includes file names and named 
elements of code such as functions, methods, and classes. 


Table 2: Relation Types 



Subject Entity 

Relation 

Object Entity 

N 

1 

SW_Vendor 

is_vendor_of 

SW_Product 

14 

2 

SW_Version 

is_version_of 

SW_Product 

6 

3 

CVEJD 

CVE_oLvuln 

Vuln_term 

1 

4 

MSJD 

MS_oLSW 

SW_Product 

6 

5 

MSJD 

MS_oLvuln 

Vuln_term 

7 

6 

VulnJerm 

vuln_oLSW 

SW_Product 

14 

7 

SW_Symbol 

symboLof 

SW_Product 

2 

8 

SW_Version 

not_version_of 

SW_Product 

9 


Note: N denotes number of seed patterns for each relation. 


ping begins. Entity types with examples and extraction 
method are detailed in Table]!] Lists of software vendors and 
software products were queried from the Freebase API un¬ 
der the /computer/software-developer category and /com¬ 
puter/software, /computer/software/operating_system, re¬ 
spectively. A major advantage of this approach is that Free- 
base includes alternate names for entities allowing easy dis¬ 
ambiguation; e.g., “IE” and “Internet Explorer” are all iden¬ 
tified as aliases of the same entity in Freebase’s database. 
In order to discard documents unlikely to contain a relation 
before further processing, a logistic regression classifier is 
trained on a few documents hand-labeled as relevant or not. 
For our implementation, the number of each entity type ap¬ 
pearing in the document are the input features, although the 
framework is designed for a more robust feature set if de¬ 
sired. Lastly, during the bootstrapping process for relation 
extraction, if a known pattern is identified with an unlabeled 
entity, that entity is then labeled. Hence, in the learning of 
relations and patterns, the bootstrapping process can aid in 
identification of entities omitted by the initial gazetteers or 
regular expression taggers. 

2.2 Relations and Patterns 

The bootstrapping process is run independently for each 
of 9 relation types listed in Table [3 Informed by research 
efforts to build an ontology of the security domain [10], the 
first eight relations correspond to attributes of the “Vulner¬ 
ability” and “Software” nodes of the ontology, while the last 
relation, “not_version_of” was added to indicate and prevent 
common errant inferences of the opposite relation. For each 
relation, a few hand-crafted rules are given as input to start 
the bootstrapping process, as reported in Table [3 

Our implementation involves three types of patterns: (1) 











all the words/parts-of-speech in order between two appropri¬ 
ate entity types, (2) a subset of contiguous words/parts-of- 
speech occurring between two appropriate entity types, (3) 
the parse tree path between two appropriate entity types. A 
parse tree is an assignment of a tree structure to each sen¬ 
tence with the words as leaf nodes, words’ parts-of-speech as 
their parents, and other ancestral nodes correspond to the 
sentence’s structure as given by a context-free grammar. By 
definition there exists a unique path between any two nodes 
in a tree, and, in particular, for two known entities in a sen¬ 
tence, the parse tree path, known as a dependency path, 
gives a natural feature set for deducing the existence of a 
relation between the entities lUElElITs]. 


Example parse tree for sen¬ 
tence “I like eggs.” is given. 
For this example the pat¬ 
tern we extract from the 
parse tree path from “I” to 
“eggs” is [N, NP, S, VP, N]. 



2.3 Scoring & Active Learning 

A common pitfall of bootstrapping algorithms is straying 
from the desired topic by iteratively learning spurious pat¬ 
terns and relations. To increase accuracy, a variety of scor¬ 
ing methods for rating confidence in a nominated relation 
or pattern have been proposed Izllllll], and the common, 
underlying goals of scoring methods are (1) to seek patterns 
that are indicative of the given relation, but do not occur 
with spurious instances, and (2) to seek relations that only 
occur with trusted patterns. Additionally, active learning, 
which refers to systems that query the user to provide accu¬ 
rate input on a few pertinent examples, is incorporated into 
our system, and has been used in a few previous bootstrap¬ 
ping works mm- Discussion and comparative evaluation of 
five bootstrapping scoring methods, in particular in the pres¬ 
ence of active learning is the topic of [3- As the BASILISK 
system of |18] showed the greatest beneht from user inter¬ 
action in the study, our scoring procedure is inherited from 
their implementation as described here m- 

Specifically, if a potential relation instance, r, is identified 
by distinct patterns pi,...,pn, R_score(r) — XlILi + 

l)/n where fi is the number of unique known relations iden¬ 
tified by Pi. Thus, relations identified by many successful 
patterns will score the highest. If a potential pattern p is 
nominated by occurring at least once with unique known 
relations n,..., Vm, then P_score(p) := mlog(m)/A' where 
N = number of unique occurrences of p (with or without 
a known relation). Hence, the number of known instances 
the pattern matches is weighted by the pattern’s precision, 
and patterns that both match many known relations and 
have high precision obtain the highest scores. To help pre¬ 
vent the system from drifting, an option for user interaction 
is incorporated by allowing specification of the number of 
queries per cycle. We note that the highest scoring pat¬ 
terns are the ones that nominated the most relations, and 
therefore have the largest effect on the direction of the sys¬ 
tem; thus, for relations and patterns scoring the highest, the 
system asks a user to verify their validity with “yes”, “no”, 
“don’t know” response options. Given a response of “yes” 


(“no”), the score is set to 1000 (-1). After the scoring, re¬ 
lations/patterns with scores in a sufficiently high percentile 
are added to the set of relation/pattern seeds for the next 
cycle. In the case of conflicting relation nominations (e.g., 
“is_version_of” and “not_version_of”) the system queries the 
user if possible and defaults first to a heuristic in some cases 
and then to the highest scoring relation. Lastly, when a 
known pattern is found in the text with an unlabeled entity, 
the system queries the user if possible before labeling the 
entity. 

3. RESULTS 

For initial testing a corpus of 62 news articles, blogs, and 
updates is compiled from a variety of security-related web- 
sites0 After pulling each site’s text with the Goos43 article 
extractor, word-, sentence-tokenization, and part-of-speech 
tagging is performed using GoreNLP |13| . Only after the en¬ 
tity extractor provides entity labels and the relevancy classi¬ 
fier discards irrelevant documents (see Section [2.II) are parse 
trees applied by GoreNLP to the remaining documents—41 
documents were kept, parsed, and used as the corpus for 
bootstrapping. With an eye on scaling to a large corpus, 
this is a noteworthy detail as the parsing is computationally 
and temporally expensive (e.g., parsing took about 300 of 
370 seconds for the NLP pipeline). 

As no labeled corpus of relations for our domain was avail¬ 
able, evaluation is difficult; in particular, calculating recall 
would involve labeling all the documents. To get a rough es¬ 
timate of the recall of the system, we hand labeled a longer 
article, obtaining 8 correctly identified of 33 total relations. 
Hence at least locally the recall is 0.24. By manually check¬ 
ing each relation instance the system output, we provide 
precision results in Table |3] The reported run accepted the 
top 80% of patterns and relations after each iteration of 
scoring and queried the user for the top 2%, resulting in ~ 5 
queries per iteration. Scores are reported for output after 3 
iterations through the corpus, after which point the number 
of relations did not increase. Of the 41 relevant documents, 
31 included an identified relation, and 186 overall relations 
were identified of which 153 are correct. We note that the 
seed patterns were crafted from example sentences observed 
by the authors, and, hence, testing on a larger corpus will 
be needed to flatten any bias that may have come from our 
observation of some of the 62 documents. 


4. CONCLUSION 

In summary, our bootstrapping algorithm is a promising 
start to a pertinent problem, namely, the need for auto¬ 
mated information extraction targeting security documen¬ 
tation. As our preliminary tests involved a relatively small 
corpus, further testing on a larger corpus is necessary. Ad¬ 
ditionally, comparative evaluation to quantify the benefit of 
incorporating active learning is a desired future direction. 
Ultimately, we plan to incorporate this work into a larger 


^Text sources from www.arstechnica.com 
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http://blog.cmpxchgSb.com 

https://blog.malwarebytes.org, & www.threatpost.com 
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Table 3: Results 



Relation 

TP 

FP 

P 

1 

is_vendor_of 

45 

12 

0.79 

2 

is_version_of 

54 

19 

0.74 

3 

CVE_oLvuln 

- 

- 

- 

4 

MS.oLSW 

- 

- 

- 

5 

MS_oLvuln 

2 

0 

1.00 

6 

vuln_of_SW 

30 

2 

0.94 

7 

symboLof 

- 

- 

- 

8 

not version of 

22 

0 

1.00 


Totals 

153 

33 

0.82 


Note: True positives, false positives, and precision by relation 
type and in total reported. Dashes indicate no instances were 
found in the corpus. 

pipeline, which continually feeds it new documents from the 
web, and organizes the output into a database. 
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